Elastic energy of proteins and the stages of 
protein folding 

J. Lei* K. Huang* 
February 26, 2010 

Abstract 

We propose a universal elastic energy for proteins, which depends only 
on the radius of gyration R a and the residue number N. It is constructed 
using physical arguments based on the hydrophobic effect and hydrogen 
bonding. Adjustable parameters are fitted to data from the computer sim- 
ulation of the folding of a set of proteins using the CSAW (conditioned 
self-avoiding walk) model. The elastic energy gives rise to scaling rela- 
tions of the form R g ~ N u in different regions. It shows three folding 
stages characterized by the progression with exponents v = 3/5, 3/7, 2/5, 
which we identify as the unfolded stage, pre-globule, and molten globule, 
respectively. The pre-giobule goes over to the molten globule via a break 
in behavior akin to a first-order phase transition, which is initiated by a 
sudden acceleration of hydrogen bonding. 

1 Introduction 

The folding of a protein chain in water is driven mainly by the hydropho- 
bic and hydrogen-bonding interactions pQ. The general shape of the fold, 
or tertiary structure, is designed to have hydrophobic side chains buried 
in the interior of the protein, in order to avoid direct contact with wa- 
ter. However, the side chains are attached to a backbone that welcomes 
exposure to water, because of their need for hydrogen bonding. These op- 
posing tendencies reach mutual accommodation through the formation of 
secondary structures — alpha helices and beta sheets — which "use up" 
the hydrogen bonds on the backbone. This involves an intricate interplay 
between global geometry and local structure, and each protein seems to 
present special problems 2 . Proteins with different amino-acid sequences 
can invoke quite different folding mechanisms 3 , while proteins with high 
sequence similarity can end up with very different folds [H [5] . Neverthe- 
less, universal aspects do emerge, if one overlooks details and concentrate 
only on a few general properties. 
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An overall characteristic of the tertiary structure is the radius of gy- 
ration R g , the root-mean-square separation between residues. It serves 
as a length scale, whose behavior can be studied experimentally [51 El IE]- 
Stages in the folding process are characterized by scaling relations of the 
form R g ~ N", where N is the number of amino-acid residues. The "com- 
pactness index" v can be measured through small-angle x-ray scattering 
[9j. It can be derived from physical models based on intuitive reasoning, as 
pioneered by Flory [TO] and de Gennes [TT] in the theory of homopolymers. 

The unfolded protein chain, which is akin to a homopolymer, can be 
described by Flory's model based on SAW (self-avoiding walk) [TD]. One 
assumes a potential energy of the form a (i^/TV) + b (N 2 /R®^ , where o 
and b are temperature-dependent coefficients, and D is the spatial dimen- 
sion. The first term is a stretching energy associated with random walk, 
for which Rg scales like N. The second term arises from the excluded 
volume effect, and is proportional to N times the density. Minimizing the 
energy with respect to R g leads to v = 3/ (D + 2), which gives v = 3/5 
for D = 3. 

Hong and Lei [12] obtain v = 2/5 for the native state by statistical 
analysis of data from PDB (the Protein Data Bank). They also derive 
it by generalizing Flory's model. Through detailed arguments, they gen- 
eralize Flory's stretching energy to iJg/jV^ 2//a ' -1 , where a is the fractal 
dimension of the system. This leads to v = (a + 2) / [a (D + 2)], which 
yields v = (a + 2) / (5a) for D — 3. Taking the fractal dimensions to be 
a = 1,2,3 for polymer in good solvent, protein native state, polymer in 
poor solvent, respectively, one obtains 3/5, 2/5, 1/3 for the respective in- 
dices. The first case reduces to Flory's SAW model. For protein in the 
native state, the fractal dimension a — 2 can be deduced from PDB, and 
the index v = 2/5 implies Hooke's law (E — Eo) ~ R 2 ,. We note that the 
native protein is less compact than the collapsed polymer, whose index is 
1/3|11|. This is mainly because generally proteins are not fully hydropho- 
bic. The proteins with 70% hydrophobic residues do have indices close 
to 1/3[T2]. Furthermore, native proteins are not well-packed because the 
secondary structures tend to create interior free volumes [131 114] . 

Ptitsyn [15] has proposed a "molten globule" prior to the native state, 
in which the tertiary structure has taken shape, together with a large 
fraction of the final secondary structures. The main difference from the 
native state lies in orientations of side chains, which gradually lock into 
native contacts in a time scale of seconds. Thus the molten globule is 
almost as compact as the native state, and should have v = 2/5. This 
is supported by data from x-ray scattering [15] . In some proteins there 
is evidence for a pre-globule stage [TB], which goes over to the molten 
globule in a sudden jump. Its observed index is v — 0.411 ± 0.016[T7]. 

In the present investigation, we try to verify these universal features, 
and understand them in a unified picture based on the twin actions of 
the hydrophobic force and hydrogen bonding. We do this by generalizing 
Flory's potential energy to a universal elastic energy for proteins, via 
analysis and interpretation of data from computer simulations. 
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2 Protein folding in the CSAW model 



We simulate the folding of several proteins using the CSAW (conditioned 
self-avoiding walk) model [TS]. A conformation of the protein is specified 
by a set of torsion angles, and side chains are approximated by hard 
spheres. The unfolded chain is represented by SAW, and folding comes 
about through conditions that create a bias in SAW. The "self-avoidance" 
here means that all atoms on the backbone, as well as the side chains, 
are treated as hard spheres with appropriate diameters, and they are 
forbidden to overlap one another. 

One begins with a SAW of TV steps. A trial update is generated by 
the pivot algorithm [TS], and is accepted with a probability given by a 
Metropolis MC (Monte-Carlo) algorithm, based on the conformation en- 
ergy 

e = — g\K\ - g-zK%. (1) 

The two terms here correspond to the hydrophobic effect and hydrogen 
bonding, respectively. The quantity K\ is a total hydrophobic shielding 
number, which measures how well hydrophobic side chains are shielded 
from water by neighboring protein atoms, and is defined as follows. The 
i'th residue has a hydrophobicity hi given by experiments .20 , and a con- 
tact number hi, which is the number of nearest-neighbors of its side chain, 
in the existing conformation, and K\ = ^2- =1 hiki. Here, we adopt the 
HP model for the primary sequence, and assign hi = 1 for the hydropho- 
bic residues (L, P, M, W, A, V, F, I), and hi = for the others [21"]. The 
quantity K2 is the total number of internal hydrogen bonds, which are 
deemed to be in existence whenever two legitimate partners have relative 
positions that conform to the bond separation and angle. Other effects 
such as electrostatic and van der Waals interactions are ignored; but they 
can be easily included if desired. The MC updates eventually generate a 
sequence of conformations that form a canonical ensemble with respect 
to the conformation energy e. The action of water is described entirely 
through the hydrophobic energy — giifi, and the random impacts implicit 
in random walk. 

We assume that the parameters gi, g-z take the same values for all 
proteins, which are determined in an earlier study of polyalanine 22 . 
The temperature is fixed, and implicit in the values of gi, 32, but not yet 
calibrated in the absolute scale. 



Protein name 


ID 


N 


h 


Structure 


Polyalanine 


ala20 


20 


1.000 


1 alpha helix 


Antimicrobial LCI 


2b9k 


47 


0.383 


1 beta sheet 


Tedamistat 


3ait 


74 


0.351 


2 sheets 


Myoglobin 


lmbs 


153 


0.379 


8 helices 


Asparagine 
synthetase 


lias 


330 


0.397 


11 helices 
8 sheets 



Table 1: Proteins simulated. N = number of residues, h = fraction of hydropho- 
bic residues. 

We simulate five proteins, as listed in Table 1. For each protein, folding 
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starts from an unfolded state created by heating the native state to a high 
temperature, and then quenched to a fixed low temperature. The folding 
process runs for 4 x 10 6 MC steps, with snapshots taken every 1000 steps. 
For each protein, the entire procedure is repeated 30 times to generate 
30 folding trajectories, or an ensemble of 1.2 x 10 4 conformations. In the 
case of myoglobin, folding is extended to a total of 2.4 x 10 7 MC steps. 




MC Steps 



Figure 1: Average radius of gyration and number of internal hydrogen bonds 
vs. MC steps from the simulations of four globular proteins, listed with residue 
numbers in parenthesis, in the order of the curves from top down. 

Fig. [T] shows the average radius of gyration and number of internal 
hydrogen bonds as a function of MC steps from the simulations of four 
globular proteins. From Fig. Q3 we see that the radius of gyration de- 
creases in the early stage, and form a plateau in a late stage (except the 
protein lias whose simulation has not yet reached the plateau stage). This 
is consistent with experimental observations 6 7 , 8] . Existence of such sta- 
ble collapsed states have been found in recent AFM experiments 23 . The 
number of internal hydrogen bonds shows a continuous increase during 
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the whole process. 

We compute an average potential energy E(R g ,N), defined as follows. 
For each N, we take the ensemble average of the model energy e over all 
conformations along the folding trajectory that share the same value of R g . 
Thus, —dE/dRg gives the pressure-force experienced by the protein as a 
function of radius, and can be observed in force spectroscopy experiments 
[241 125]. In this sense we can call E(R g ,N) an "elastic energy". The 
simulation results are shown in Fig[5] 

We can display universal features in the data by rescaling the variables 
(Fig. [3J. The scales chosen are based on physical pictures, and their 
validity is to be judged by goodness of fit. First, we rescale the energy by 
a factor N 4 ^ 5 . This is chosen because in the native state the scaling laws 
E ~ N 4/5 and R g ~ N 2/5 lead to Hooke's law E ~ R 2 g . On the horizontal 
axis of the plot, we rescale the radius by a factor N" to produce two 
separate plots, with v = 3/5 to examine the unfolded region, and v — 2/5 
to examine the collapsed region. The rescaled graphs are shown in Fig(3ja) 
for the 3/5 scaling, and FigOTJb) for the 2/5 scaling. As we can see, the 
data do exhibit universal behaviors in the respective regions, except for 
ala20, which fails to scale in the collapsed region. This exceptional protein 
has hydrophobic fraction h = 1, whereas the others have average h w 0.37. 
This exception, in fact, shows the relevance of hydrophobicity. 




20 40 60 80 100 120 



Figure 2: Average potential energy vs. radius of gyration, from computer simu- 
lations (points) and universal potential ((4]) (solid curves) of five proteins, listed 
with residue numbers in parenthesis, in the order of the curves from top down. 



3 A universal elastic energy 

We use physical arguments to suggest an analytical form of a universal 
elastic energy, and fit undetermined parameters to simulation data. Let 
us start from the unfolded chain, which is indicated in Fig[3^a). Assuming 
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Figure 3: (a) Rescaling the data to exhibit universality in the unfolded stage, 
(b) A different rescaling reveals universality in a collapsed region. The case 
ala20 (green pentagon) is exceptional, being completely hydrophobic. 



a power law E/N 4/5 ~ (R g /N 3/5 ) p , we have E ~ R p N (4 - 3p)/5 . In the 
hypothetical limit of a completely extended chain, we should have R g ~ TV 
and E ~ N. This determines p = 1/2, and hence £ ~ (Af/? 9 ) 1/2 . This 
scaling law describes the hydrophobic energy, since hydrogen bonding is 
not significant in this stage. Now we fix the reference point of energy by 
taking E = for a completely extended chain, which is the convention 
used in CSAW simulations. Thus we arrive at the following potential 
energy in the unfolded region: 

Ei(Rg,N) = aN 4/5 + bN 1/2 Rl /2 y (2) 

where a and b are parameters. As the chain folds, and R g decreases, the 
excluded volume effect becomes important. To take this into account, we 
add to Ei a Flory term N 2 /R 3 . Thus, we replace E\ with 

E 2 {R g ,N) = a'N 4/5 + b'N 1/2 Rl /2 + c'N 2 /R 3 g , (3) 

where a', b' , c' are new parameters. 

The energy E2 has a local minimum corresponding to a metastable 
state, with scaling law R g ~ N 3/7 , and E ~ N 5/7 . The index v = 3/7 is 
consistent with the measured value 0.411 ±0.016 for the natively unfolded 
proteins in pre-globule state |17p1 . 

The universality in the collapsed region (Fig. EJb)) suggests a power 
law E/N 4/5 ~ (R g /N 2/5 ) q , which implies E ~ R q g N {A - 2q)/5 . Fitting these 
forms to the pre-globule state, we find q = -3, and hence E ~ N 2 /R? g . 
This energy involves hydrogen bonding. We note that it has the same 

1 Because of the hydrogen bonding attraction in the pre-globule, the experimental data 
shows an index closer to that for the molten globule (2/5 = 0.4), rather than the 3/7 = 0.43. 
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form as Flory's excluded-volume energy, and thus is already taken into 
account in E?.. 

When a system of hard spheres on a chain collapses, overlaps can only 
be avoided through elaborate rearrangements, and the associated thermal 
energy depends on the intricate structure of the chain, and can only be 
treated phenomenologically. We do this by allowing the excluded volume 
c' to depend on N and R g . Since the excluded volume effect should 
be universal for proteins at the molten globule stage, with scaling law 
R g ~ jV 2//5 , c' must depend on R g through the scaled radius p = R g /N 2 ^ 5 . 
Thus we arrive at a universal elastic energy given by 

E(R g , N) = fc iV 4/5 + fci (NR g ) 1/2 + N 4/5 U(p). (4) 

Here, U (p) is a short-ranged potential, which we expand in an inverse 
power series of its argument: 

U(p) = ]T k n p- n . (5) 

n>3 

In the power series ©, the terms with larger exponents describe 
higher-order short-range interactions, which depend on the detailed struc- 
ture, and are chain specific. To obtain a universal elastic energy with 
the least number of parameters, we limit ourselves to odd powers 3 < n 
< 15. This gives rise 9 adjustable parameters, which are fitted to simula- 
tion data, excluding those from ala20, and lias. The former is excluded 
because it is atypical, being 100% hydrophobic; the latter, because its 
simulation has not yet reached the pre-globule. With R g measured in 
angstroms, the parameters are as follows: fco = —1.50, and for odd n from 
1 to 15, kn = 2" x {0.43, 2.53, -34.49, 155.87, -326.50, 350.09, -187.19, 
39.59}. 

4 The stages of protein folding 

Graphs of the elastic energy are shown in FigO The fits are not ideal for 
ala20, presumably because the protein is all hydrophobic, nor for lias, 
because the simulation in this case is far from complete. 

We refer to the curve of lias for illustration. The region with large 
R g corresponds to the unfolded stage, which is under pressure to collapse, 
since 8E/dR g > 0. The only stable point on the curve is the lowest 
minimum, which corresponds to the most compact state. The states with 
smaller R g have rapidly increasing energy because of the excluded-volume 
effect. There is a flat shoulder corresponding to the molten globule, which 
can exist in neutral equilibrium. The radii of the molten globule and the 
most compact state are plotted in Fig(4ja), with R g ~ 1.66iV 2/ ' 5 for the 
most compact state, and R g ~ 3.01N 2 ^ 5 for the molten globule. We can 
see from[4][a) that the above theoretical results agree well with the exper- 
imental data. The energy of the molten globule, shown in Fig(4^b), obeys 
(E-E ) ~ N i/S , which implies (E — Eq) ~ R 2 g . This agrees with the 
Hookes-law behavior in the collapsed state. In summary, the three fold- 
ing stages — unfolded, pre-globule, molten globule — are characterized 
by the progression v = 3/5, 3/7, 2/5 of the compactness index. 
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Figure 4: (a) Radius of gyration for molten globule: R g = 3.017V 2 / 5 , and the 
most compact state: R g = 1.66/V 2 / 5 . Experimental data are also shown, with 
37162 proteins in PDB (dots)[T2], the most compact proteins from [T^ (circles), 
and the proteins in molten globule state |26j (squares) . (b) Energy of equilibrium 
state as a function of N: E = 12.45 — 0.80./V 4 / 5 . The solid points on the curves 
show that numerical results from the potential (j4|. 
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MC Steps 



Figure 5: Growth of hydrogen bonds in the simulation of myoglobin. The onset 
of the transition from pre-globule to molten globule is marked by a sudden jump 
in the growth rate. 



The rate of hydrogen bonding increases in the successive folding stages, 
as illustrated in Fig[5]from the simulation of myoglobin. The transition 
from the pre-globule to the molten globule is marked by a sudden accel- 
eration of hydrogen bonding. This initiates the analog of a first-order 
phase transition. Like that in macroscopic matter, the volume is being 
compressed at constant temperature, and latent heat is released, since the 
transition connects two states of different energy. However, in a small sys- 
tem such as the protein, there is no clear separation of coexisting phases. 

The subsequent evolution from the molten globule to the native state 
cannot be described in the present simulation, since it involves the locking 
of side chains, and we have approximate them with hard spheres. As we 
learn from experimental data, however, v should remain unchanged. 



5 Outlook 

The elastic energy here is constructed at a fixed temperature. Work is in 
progress to extend it to a range of temperatures, with a view of obtaining 
a universal equation of state for proteins. Also under consideration is the 
use of the potential in a kinetic equation to compute the lifetimes of the 
various stages of protein folding. Although we concentrate on universal 
properties here, the CSAW model actually yields results for individual 
proteins, which could be analyzed for specificity. The CSAW model is 
highly flexible, and amenable to refinements, such as realistic simulation 
of side chains, and inclusion of other interactions not yet considered. 

This work is supported in part by the National Natural Science Foun- 
dation of China (NSFC10601029). 



9 



References 

[1] Branden C. Tooze J. Introduction to Protein Structure, Garland Pub- 
lishing, New York (1998). 

Shakhnovich E. Chem. Rev. 69(2006): 1559. 

Daggett V. Fersht A. R. Trends Biochem Sci. 28(2003):18. 

Alexander P.A., et al. Biochemistry 44(2005)(14045). 

He Y., et al. Biochemistry 46(2005): 14055. 

Akiyama A., et al. Proc. Natl. Acad. Set. USA 99(2002):1329. 

Uzawa T., et al. Proc. Natl. Acad. Sci. USA 101(2004):1171. 

Kimura T., et al. Proc. Natl. Acad. Sci. USA 102(2005):2748. 

Huang K. Lectures on Statistical Physics and Protein Folding, World 
Scientific, Singapore (2005), Section8.4. 

Flory P. Principles of Polymer Chemistry, Cornell University Press, 
London (1953). 

de Gennes P. G. Scaling Concepts in Polymer Physics, Cornell Uni- 
versity Press, Ithaca(1979). 

Hong L. Lei J. J. Polymer Sci. B 47(2009) :207. 

Arteca, G. A. Phys. Rev. E. 51(1993):2600. 

Liang, J. Dill, K. A. Biophys. J. 81(2001)751. 

Ptitsyn O. B. Protein Folding T. E. Creighton T. E. Ed. W.H. Free- 
man & Co., New Yirk (1992). 

Uversky V. N. Ptitsyn O. B. J. Mol. Biol. 255(1996):215. 
Uversky V. N. Protein Sci. 11(2002):739. 
Huang K. Biophys. Rev. Lett. 2(2007) :139. 
Li B., Madras N. Sokal A. D. J. Stat. Phys. 80(1995):661. 
Kyte, J. Doolittle, R. A. J. Mol. Biol. 157(1982):105. 
Garrett, R. H. Grisham, C. M. Biochemistry, Thomson(1999):84-85. 
Lei J. Huang K. Eur. Phys. J. E 27(2008):197. 

Garcia-Manyes, S., et al. Proc. Natl. Acad. Sci. USA 
106(2009):10534. 

Cecconi, C, et al. Science 309(2005):2057. 

Fernandez, J. M. Li, H. Science 303(2004): 1674. 

Tcherkasskaya, O. Uversky, V. N. Proteins Struct. Fund. Genet. 
44(2001):244. 



10 



